Week 8.3 - Document Intelligence

Overview

Research lives in PDFs. Supplementary data tables, scanned archival documents, preprints, journal papers, grant reports — the information researchers need most is often locked in formats that resist easy extraction. AI has made substantial progress here, but the gap between “I can read this PDF” and “I can reliably extract its data” is wider than most tools acknowledge.

This sub-lesson unpacks the technical reasons PDFs are hard, maps the current tool landscape with honest accuracy figures, identifies the complex table problem as the central unresolved challenge, and gives you a practical workflow for document AI in research settings.

📄 Why PDFs Are Harder Than They Look

🔎 The Technical Reality of PDF Structure

PDFs were designed for printing, not data exchange. They store text as positioned glyphs — not as structured content. A column of numbers in a table is not stored as a table: it is stored as individual text fragments at specific page coordinates. When an AI “reads” a PDF, it must first reconstruct the logical structure (paragraphs, headings, table cells, figure captions) from this positional information. This reconstruction is the hard part, and it is where most errors originate.

There are three distinct PDF types, each with different extraction characteristics:

Native PDF (text layer): text is stored as text; extraction is relatively reliable. Most modern journal PDFs fall into this category.
Image PDF (scanned): the entire page is an image; extraction requires OCR before any structure reconstruction is possible. Older archival documents and some supplementary files are often image PDFs.
Mixed PDF: some pages have text layers, some are scanned — the most common and most error-prone type for research documents. A thesis that includes a scanned signature page is technically a mixed PDF.

Identifying your PDF type before choosing a tool is the single most useful habit you can develop for document extraction work.

🔧 The Tools Landscape

The field has moved quickly. The following table summarises the leading options as of 2025, drawing on the OmniDocBench benchmark (CVPR 2025) and independent evaluations.

Tool	Best For	Accuracy (Simple Tables)	Accuracy (Complex Tables)	Local / Cloud	Free?
Docling (IBM)	Research PDFs, local processing	Strong	Good (struggles with row merging)	Local	Open-source
LlamaParse (LlamaIndex)	Highest accuracy overall	Excellent	Excellent	Cloud	Limited free tier
Marker	Scanned papers	Strong	Moderate	Local	Open-source
Unstructured	Diverse document types	Strong	Moderate	Cloud / Local	Open-source
Azure Document Intelligence	Enterprise scale	Good	Good	Cloud	Paid

💡 Current Best Practice Recommendation

For most research use cases: Docling is the fastest local option and handles most standard academic PDFs well. For tables in supplementary materials, LlamaParse gives the best accuracy but requires API access. For scanned documents, Marker runs full OCR on every page. A hybrid approach — Docling for speed, LlamaParse for the tables that matter most — is the current best practice recommendation from benchmarking studies (OmniDocBench, CVPR 2025; note that top tools now exceed 94% on the original benchmark, and a harder v1.6 subset has been released to track continued progress).

📋 The Complex Table Problem

🔍 The Central Unresolved Challenge

A complex table is any table with merged cells, nested headers, multi-level row or column spans, footnotes referenced within cells, or values that span multiple rows. These are extremely common in research: ANOVA tables, regression output tables, pharmaceutical trial results, census data extracts. If you work with quantitative data, you work with complex tables.

Current performance: even the best tools (LlamaParse, Docling) break down on complex tables. The specific failure modes are well-documented:

Row merging: cells that span multiple rows are often split into duplicate entries or lost entirely
Multi-level headers: a header spanning three columns becomes three independent, disconnected headers with no parent-child relationship preserved
Footnote associations: the superscript “a” in a cell that refers to footnote “a” at the bottom of the table is lost — the association between cell and note is not reconstructed
Reading order: in some table layouts, the logical reading order (how a human reads the data) differs from the spatial order (how glyphs are positioned on the page); tools that follow spatial order produce structurally invalid output

Practical consequence: always verify extracted table data cell by cell for any table with merged cells or complex structure. Do not assume extraction accuracy for complex tables regardless of which tool you use.

📌 The “Lost in the Middle” Problem

⚠️ A Long-Document Limitation Worth Knowing About

When you give an AI model a long document, it has historically paid more attention to content at the beginning and end than to content in the middle. This was documented systematically by Liu et al. (2024) in “Lost in the Middle: How Language Models Use Long Contexts” (Transactions of the Association for Computational Linguistics, arXiv:2307.03172) — a foundational finding that frontier long-context models have partially addressed but not eliminated. The U-shaped attention curve has flattened with newer training methods, but performance on locating specific information in the middle of long contexts is still measurably weaker than at the edges.

For researchers: if you are asking an AI to find specific information in a long paper — methodological details, a contradictory finding, a specific numerical result — do not assume that “1M-token context window” means equal attention to all 1M tokens. For important information, extract the relevant section and query it separately rather than relying on the AI to correctly locate it within the full document. This is the safer workflow even where the underlying limitation has improved.

📖 Reference

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. arXiv:2307.03172

✍️ Scanned and Handwritten Documents

📝 Printed, Legible Handwriting

Frontier vision-language models now achieve roughly 96–98% character-level accuracy (~3–4% WER) on standard handwritten text benchmarks like IAM and RIMES — GPT-4o reports ~1.69% character error rate on RIMES (Crosilla et al., 2025, arXiv:2503.15195). Specialised open models like Allen AI's olmOCR-2-7B (Oct 2025, arXiv:2510.19817) match commercial systems on document OCR, though they are not primarily evaluated on handwriting. Accuracy drops into the 80–95% range on noisy, historical, or cursive handwriting.

📂 Archival Documents

Degraded scans, unusual page layouts, yellowing or staining, and mixed handwriting/print significantly degrade all models. Budget time for manual review on archival material — AI can accelerate the work but cannot replace careful human checking for documents where accuracy matters.

🌐 Multi-Language OCR

Most tools perform well on printed English and Western European languages. Performance is measurably lower for African scripts, many Asian scripts, and non-Latin alphabets. For research that involves documents in Zulu, Xhosa, Afrikaans, or other African languages, verify tool support explicitly before committing to a workflow.

📄 Practical Tool: Marker

Marker processes every page through full OCR, making it robust to scanned documents even at the cost of processing speed. For archival research where you cannot guarantee that documents have a text layer, Marker is the safest default choice among the open-source options.

🚀 Document AI in Research Workflows

The following workflow reflects best practices for integrating document AI into academic research — prioritising accuracy verification at each stage where errors are most likely to occur.

Identify your document type first (native PDF, scanned, or mixed) — this single decision determines which tool to use and what accuracy to expect. Open the PDF, zoom in, and try selecting text: if text is selectable, it has a text layer; if not, it is a scanned image.
For native PDFs with standard structure: Docling works well and is free, local, and produces clean Markdown output. Install via pip install docling; the Python API is straightforward for batch processing.
For complex tables in supplementary files: use LlamaParse and verify every extracted table against the visual layout. LlamaParse’s premium mode uses a multimodal model to “see” the table rather than parse glyphs, which substantially improves complex table accuracy.
For scanned documents: Marker or Azure Document Intelligence; budget time for manual checking of all tables and numerical values.
For long documents with specific target information: extract the relevant section (the methods section, a specific table, a particular results subsection) before querying rather than querying the full document. This directly mitigates the “lost in the middle” effect.
Always check extracted tables against the visual layout: open the PDF alongside the extracted output and compare. For complex tables, check every cell. For simple tables, spot-check at least the header row and first and last data rows.
For batch processing multiple papers: start with a sample of 5–10 where you know the correct values, measure accuracy before scaling to the full corpus. Discovering a systematic error after processing 200 papers is far more costly than discovering it after 10.

📚 Readings

📋 Core Reading

Liu et al. (2024). “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics.
https://arxiv.org/abs/2307.03172

📊 Supplementary Reading

OmniDocBench (CVPR 2025) — the most comprehensive 2025 benchmark for document parsing. Note that top tools now exceed 94% on the original benchmark; a harder v1.6 subset was released to track continued progress on nested tables, dense formulas, and unconventional layouts.
https://github.com/opendatalab/OmniDocBench

✅ Sub-Lesson 3 Summary

PDF structure: PDFs store text as positioned glyphs, not structured content. Three types — native, scanned, and mixed — require different tools and yield different accuracy levels.

Tool landscape: Docling (best local option for standard academic PDFs), LlamaParse (best overall accuracy, cloud), Marker (best for scanned documents), Unstructured (versatile). A hybrid approach is the current recommended practice.

Complex tables: The central unsolved problem. Row merging, multi-level headers, footnote associations, and reading order all cause systematic failures even in the best tools. Always verify complex table extractions cell by cell.

Lost in the middle: AI models attend less to content in the middle of long documents. Extract specific sections before querying rather than querying full documents.

Up next — Sub-Lesson 4: We move from documents to audio. Transcription AI, word error rates, South African language performance, and qualitative analysis software.